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Abstract 

Kernel methods give powerful, flexible, and the¬ 
oretically grounded approaches to solving many 
problems in machine learning. The standard ap¬ 
proach, however, requires pairwise evaluations 
of a kernel function, which can lead to scalabil¬ 
ity issues for very large datasets. Rahimi and 
Recht (2007) suggested a popular approach to 
handling this problem, known as random Fourier 
features. The quality of this approximation, how¬ 
ever, is not well understood. We improve the uni¬ 
form error bound of that paper, as well as giving 
novel understandings of the embedding’s vari¬ 
ance, approximation error, and use in some ma¬ 
chine learning methods. We also point out that 
surprisingly, of the two main variants of those 
features, the more widely used is strictly higher- 
variance for the Gaussian kernel and has worse 
bounds. 

1 INTRODUCTION 

Kernel methods provide an elegant, theoretically well- 
founded, and powerful approach to solving many learning 
problems. Since traditional algorithms require the com¬ 
putation of a full N X N pairwise kernel matrix to solve 
learning problems on N input instances, however, scaling 
these methods to large-scale datasets containing more than 
thousands of data points has proved challenging. Rahimi 
and Recht (2007) spurred interest in one very attractive ap¬ 
proach; approximating a continuous shift-invariant kernel 
A: : <T X <T —>• K by 

k{x,y) « z{x)'^z{y) =: s{x,y), 

where z : X ^ K^. Then primal methods in can 
be used, allowing most learning problems to be solved in 
0{N) time (e.g. Joachims 2006). Recent work has also ex¬ 
ploited these embeddings in some of the most-scalable ker¬ 
nel methods to date (Dai et al. 2014). 


Rahimi and Recht (2007) give two such embeddings, based 
on the Fourier transform P{uj) of the kernel k: one of the 
form 


z(x) 



sin(tLi7a^) 

cos(a;|a::) 

cosiujl/^x) 


'iid T-)( \ 

CJ, ~ P(UJ) 


( 1 ) 


and another of the form 



cos(a;|x -f 6i) 
cos{ojJ)X + bo) 


‘iid j-y/ \ 

UJi ~ P{UJ) 


b, ~ Unif[o,27r] 


( 2 ) 


Bochner’s theorem (1959) guarantees that for any contin¬ 
uous positive-dehnite function k{x — y), its Fourier trans¬ 
form will be a nonnegative measure; if fc(0) = 1, it will be 
properly normalized. Letting s be the reconstruction based 
on z and s that for z, we have that: 


D/2 


s(.x, y) = ^ X cos(w7 {x-y)) 

1 ^ 

Kx, y) = cosiujJ {x - y)) + cos{ujI {x + y) + 2b,). 


Z=1 


Letting A := x — y, we have: 


Ecos(a;'^A) =^J = 3?A:(A) 

cos(a;'''(a; + y) + 2b) = 0. 


(3) 

(4) 


Thus each s{x^ y) is a mean of bounded terms with expec¬ 
tation k{x, y). For a given embedding dimension D, it is 
not immediately obvious which approximation is prefer¬ 
able: z gives twice as many samples for oj, but adds ad¬ 
ditional (non-shift-invariant) noise. The academic litera¬ 
ture seems split on the issue; of the hrst 100 papers cit¬ 
ing Rahimi and Recht (2007) in a Google Scholar search, 
15 used either z or the equivalent complex formulation, 14 
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used z, 28 did not specify, and the remainder didn’t use the 
embedding. (None discussed that there was a choice.) Not 
included in the count are are Rahimi and Recht’s later work 
(2008a; 2008b), which used 2 ; indeed, post-publication re¬ 
visions of the original paper only discuss z. Practically, 
we are aware of three implementations in machine learning 
libraries, each of which use z at the time of writing: scikit- 
learn (Pedregosa et al. 201 1), Shogun (Sonnenburg et al. 
2010), andJSAT (Raff 2011-15). 

We show that z is superior for the popular Gaussian kernel, 
as well as how to decide which to use for other kernels. 

The primary previous analyses of these embeddings, out¬ 
side the one in the original paper, have been by Rahimi and 
Recht (2008a), who bound the increase in error of empirical 
risk estimates when learning models in the induced RKHS, 
and by Yang et al. (2012), who compare the ability of the 
Nystrom and Fourier embeddings to exploit eigengaps in 
the learning problem. We instead study the approximation 
directly, providing a complementary view of the quality of 
these embeddings. 

Section 2.1 studies the variance of each embedding, show¬ 
ing that which is preferable depends on the kernel as well 
as the particular value of A, but for the popular Gaussian 
kernel s is uniformly lower-variance. Section 2.2 studies 
uniform convergence bounds, tightening constants in the 
original z bound and proving a comparable one (with worse 
constants) for z, bounding the expectation of the maximal 
error, and providing exponential concentration about the 
mean. Section 2.3 studies the L 2 convergence of each ap¬ 
proximation; z is again superior for the Gaussian kernel. 
Section 3 discusses the effect of this approximation error 
when used in various machine learning methods. Section 4 
evaluates the two embeddings and the bounds empirically. 

2 APPROXIMATION ERROR 



Figure 1: The variance per dimension of s (blue) and s 
(orange) for the Gaussian RBF kernel (green). 

using cos(q!) cos{/3) = ^ cos(a + f3) + ^ cos(a — /3) and 
also Ecos(a;^A) = k{A). Thus 

Var S(A) = ^ [l + ^(2A) - 2k{Ay] . (5) 

Similarly, denoting x + yhy t, 

Cov{s{x,y),s{x',y')) 

= ^ Gov (cos(a;^A) -I- cos(a;^f -f 2b), 

cos(cli^A') -I- cos(a;^f' + 2b)) 

= ^ [ifc(A - A') + ifc(A + A') - fc(A)fc(A') 

+ ^k{t - t')] 

which gives 

Var s(x, y) = ^ [1 + \k(2A) - fc(A)^] . (6) 

Thus s has lower variance than s if 

Varcos(a;^A) = ^ -f ^k{2A) — fc(A)^ < (!) 


We will give various analyses of the error due to each ap¬ 
proximation. 

2.1 VARIANCE 


The Gaussian kernel k{A) = exp 



has 


Var cos(a;^ A) 


1 

2 


^1 — exp 




(3) and (4) establish that Es(A) = k{A). What about the 
variance? We have that 

Cov(S(A),S(A')) 



= Cov (cos(a;^A), cos(w^A')) 


= ^ [ifc(A - A') + \k{A + A') - fc(A)fc(A')] 


so that z is always lower-variance than z, and the differ¬ 
ence in variance is greatest when k{A) is largest. This is 
illustrated in Figure 1 . 

2.2 UNIFORM ERROR BOUND 

Let f{x,y) := s{x,y) — k{x,y) denote the error of the 
approximation. We will investigate ||/||oo. i-C- the maxi¬ 
mal approximation error across the domain of k. We first 
consider the bound given by Rahimi and Recht (2007), and 
then provide a new bound on E||/||oo and its concentration 
around that mean. 
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2.2.1 Original High-Probability Bound 


Claim 1 of Rahimi and Recht (2007) is that if A” C is 
compact with diameter 

Pr(ll/ll„>r)< 256 (^) 

where CTp = E = tr V^fc(O) depends on the kernel. 

It is not necessarily clear in that paper that this bound ap¬ 
plies only to the z embedding; we can also tighten some 
constants. We first state the tightened bound for z. 
Proposition 1. Let k be a continuous shift-invariant 
positive-definite function k{x,y) = fc(A) defined on X d 
with fc(0) = 1 and such that V^A:(0) exists. Sup¬ 
pose X is compact, with diameter L Denote k’s Fourier 
transform as P{to), which will be a probability distribu¬ 
tion; let ap = Ep ||a;|| . Let z be as in (1), and define 
f{x,y) := z(x)~^z(y) — k{x,y). For any e > 0, let 

ae ■■= min ( 1, sup \ -f \k{2x, 2y) - k{x, yY + k ) , 
V X,y^x ^ ^ J 

Pd ■- + (i)^^ 2*^. 


Then, assuming only for the second statement that e < api, 

De^ 


Pr 


C30 ^ ^ P d 

< 66 


CTpf \ 1 + 5 


exp - 


api 


exp 


8(d 2'jcx^ 


8(d-f 2) 


Thus, we can achieve an embedding with pointwise error 
no more than e with probability at least 1 — <5 as long as 


D > 


8(^d 2)cre 





The proof strategy is very similar to that of Rahimi and 
Recht (2007): place an e-net with radius r over Xa ■= 
{x — y : x,y G X}, bound the error / by e/2 at the cen¬ 
ters of the net by Hoeffding’s inequality (1963), and bound 
the Lipschitz constant of /, which is at most that of s, by 
e/(2r) with Markov’s inequality. The introduction of 
is by replacing Hoeffding’s inequality with that of Bern¬ 
stein (1924) when it is tighter, using the variance from (5). 
The constant Pd is obtained by exactly optimizing the value 
of r, rather than the algebraically simpler value originally 
used; Pq 4 = 66 is its maximum, and lim(i_).oo Pd = 64, 
though it is lower for small d, as shown in Figure 2. The 
additional hypothesis, that V^fc(O) exists, is equivalent to 
the existence of the first two moments of P{oj)', a finite first 
moment is used in the proof, and of course without a finite 
second moment the bound is vacuous. The full proof is 
given in Appendix A.l. 

'Note that our D is half of the D in Rahimi and Recht (2007), 
since we want to compare embeddings of the same dimension. 
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Figure 2: The coefficient Pd of Proposition 1 (blue, for z) 
and /3^ of Proposition 2 (orange, for z). Rahimi and Recht 
(2007) used a constant of 256 for z. 



For the Gaussian kernel, < | -I- and a^ = d/a"^', the 
Bernstein bound is tighter when e < |. 

For z, since the embedding s is not shift-invariant, we must 
instead place the e-net on X^. The additional noise in s also 
increases the expected Lipschitz constant and gives looser 
bounds on each term in the sum, though there are twice as 
many such terms. The corresponding bound is as follows: 
Proposition 2. Let k, X, i, P{u!), and Op be as in Proposi¬ 
tion 1. Define z by (2), and f{x, y) := z{xYz{y) — k{x, y). 
For any £ > 0, define 


:= min 


1, sup \ + \k{2x,2y)-\k{x,yf+ \e] , 

\ x,y&X / 

, / -d 1 N 5d+l d 

:= 1 d-i+i 4- fi-i+i 1 2 <*+1 S-^+i. 

Then, assuming only for the second statement that e < api. 


Pr 


> e < 


< 98 


api 


api 


exp - 


De^ 


exp 


32(d- 

De^ 


1)< 


32{d+l) 


Thus, we can achieve an embedding with pointwise error 
no more than e with probability at least 1 — <5 as long as 


D > 


32(d-f !)< 



api 

e 



/348 = 98, and limd_).oo P'd = 96, also shown in Figure 2. 
The full proof is given in Appendix A.2. 

For the Gaussian kernel, a/ < | -f |e, so that the Berstein 
bound is essentially always superior. 


2.2.2 Expected Max Error 

Noting that E||/||oo = /q°° Pr (||/||oo > e) d£, one could 
consider bounding E|j/|joo via Propositions 1 and 2. Un¬ 
fortunately, that integral diverges on (0, 7) for any 7 > 0. 
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If we instead integrate the minimum of that bound and 1, 
the result depends on a solution to a transcendental equa¬ 
tion, so analytical manipulation is difficult. 

We can, however, use a slight generalization of Dudley’s 
entropy integral (1967) to obtain the following bound; 

Proposition 3. Let k, X, I, and P{uj) be as in Proposi¬ 
tion 1. Define z by (1), and f{x, y) := z{x)~^z{y) — k{x, y). 
Let A’a '■= {x — y \ x,y & X}; suppose k is L-Lipschitz 
on X/^. Let R := EmaXj_]^ d ||u;i||. Then 


E 



< 


24rf\/di 


[R -f L) 


where 7 « 0.964. 

The proof is given in Appendix A.3. In order to apply 
the method of Dudley (1967), we must work around ||wi|| 
(which appears in the covariance of the error process) be¬ 
ing potentially unbounded. To do so, we bound a process 
with truncated ||wi|j, and then relate that bound to /. 

For the Gaussian kernel, L = 1/(u^/e) and^ 

^^ ‘r 

< [yd X \/2 log jcf. 

Thus E||/||oo is less than 

_|_ .^2 log(Z7/2)^ . (8) 


2.2.3 Concentration About Mean 


Bousquet’s inequality (2002) can be used to show exponen¬ 
tial concentration of sup / about its mean. 

We consider / first. Let 

^ (cos(w'^A) - fc(A)), 

so /(A) = Define the “wimpy variance” 

of //2 (which we use so that |//2| < 1) as 

D/2 

=-— sup [ 1 -f fc(2A) — 2fc(A)^l 
D aga-a 

■ ^ 

= ■ 

using (7). Clearly 1 < < 2; for the Gaussian kernel, it 

is 1. 

Proposition 5. Let k, X, and P{ui) be as in Proposition 1, 
and z be defined by (1). Let /(A) = z{x)~^z{y) — k{A) for 
A = X — y, and := supAg;tA ^ k{2A) — 2k{A)^. 

Then 




Pr 


-Ell/lloo > e 


< 2 exp — 


De^ 


m\f\\c 


icr2 

2 w 


De 
6 , 


Proof. We use the Bernstein-style form of Theorem 12.5 
of Boucheron et al. (2013) on /(A)/2 to obtain that 
Pr (sup / — E sup / > e) is at most 


We can also prove an analogous bound for the z features: 


Proposition 4. Let k,X,£, and P{uj) be as in Proposi¬ 
tion 1. Define z by (2), and f{x,y) := z(a:)^z(y) — 
k{x,y). Suppose k{A) is L-Lipschitz- Let R := 
Emaxj=i_,,,,£)||wi||. Then, for X and D not extremely 
small. 


E 



< 


Vd 


{R + L) 


( 

exp-=—-— - 

Esup/+ia)/^ + | 

The same holds for —/, and Esup/ < Esup||/||oo. 
Esup(—/) < Esup||/|joo- The claim follows by a union 
bound. □ 

A bound on the lower tail, unfortunately, is not available in 
the same form. 



where 0.803 < 7 ^ < 1.542. See Appendix A.4 for details 
on 7 ^ and the “not extremely small ” assumption. 

The proof is given in Appendix A.4. It is similar to that 
for Proposition 3, but the lack of shift invariance increases 
some constants and otherwise slightly complicates matters. 
Note also that the R of Proposition 4 is somewhat larger 
than that of Proposition 3. 


For /, note |/| < 3, so we use //3. Letting := 
T(cos(wTA) - k{A)), we have + 1 ). 

Thus the same argument gives us: 

Proposition 6. Let k and X be as in Proposition 1, with 
P{oj) defined as there. Let z be as in (2), f(x,y) = 
z{x)~^z{y) — k{x, y), and define as above. Then 

Pr(||/|U-E||/|U>e) 


^By the Gaussian concentration inequality (Boucheron et al. 
2013, Theorem 5.6), each ||tj|| — E||a;|| is sub-Gaussian with vari¬ 
ance factor the claim follows from their Section 2.5. 


< 2 exp 
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Note that Proposition 6 actually gives a somewhat tighter 
concentration than Proposition 5. This is most likely be¬ 
cause, between the space of possible errors being larger 
and the higher base variance illustrated in Figure 1, the / 
error function has more “opportunities” to achieve its max¬ 
imal error. The experimental results (Figure 5) show that, 
at least in one case, ||/||oo does concentrate about its mean 
more tightly, but that mean is enough higher than that of 
|j/||oo that ll/lloo stochastically dominates ||/||oo- 


The second version of the bound is simpler, but somewhat 
looser for ^ 1 ; asymptotically, the coefficient of the 
denominator becomes 128. 

Similarly, the variation of |j/||^ is bounded by at most 
(shown in Appendix B.2). Thus: 

Proposition 8 . Let k, p, and A4 be as in Proposi¬ 
tion 7. Define z as in (2) and let /(cc, y) = z{x)~^z{y) — 
k{x, y). Then 


2.3 L 2 ERROR BOUND 

Loo bounds provide useful guarantees, but are very strict. It 
can also be useful to consider a less stringent error measure. 
Let p be a cr-finite measure on A x A; define 



/ _D3g2 X 


11 / 11 ^:=/ f{x,yf dy{x,y). 


( 9 ) The cost of a simpler dependence on D is higher here; the 
asymptotic coefficient of the denominator is 512. 


First, we have that 


E||/||^=E/ f{x,yY dp{x,y) 
JX2 

= [ ^fix,yfdfi{x,y) 
Jx^ 


( 10 ) 


Jx^ L> 

-f / k{2x,2y)dp,{x,y)-2\\k\\ 

JX2 


1 

D 


nni = 


+ - k{2x, 2y) dp(x, y) - 


ix^ 


3 DOWNSTREAM ERROR 

Rahimi and Recht (2008a; 2008b) give a bound on the L 2 
distance between any given function in the reproducing ker¬ 
nel Hilbert space (RKHS) induced by k and the closest func¬ 
tion in the RKHS of s: results invaluable for the study of 
learning rates. In some situations, however, it is useful 
to consider not the learning-theoretic convergence of hy¬ 
potheses to the assumed “true” function, but rather directly 
consider the difference in predictions due to using the z 
embedding instead of the exact kernel k. 


where (10) is justified by Tonelli’s theorem. 


3.1 KERNEL RIDGE REGRESSION 


If ^ = Px X Py is a joint distribution of independent vari¬ 
ables, then k{2x,2y) dp{x,y) = MMK{P 2 XtP 2 y), 
where MMK is the mean map kernel (see Section 3.3). Like¬ 
wise, ||fc||^ = MMK(Px)^V) using the kernel k^? 

Viewing ||/||^ as a function of wi,... ,uJo/ 2 , changing oji 
to a different Cji changes the value of \\f\\i_i by at most 
this can be seen by simple algebra and is 
shown in Appendix B.l. Thus McDiarmid (1989) gives us 
an exponential concentration bound: 

Proposition 7. Let k be a continuous shift-invariant 
positive-definite function k{x,y) = fc(A) defined on X C 
with k(0) = 1. Let p, be a a-finite measure on X^, 
and define H-lj^ as in (9). Define z as in (1) and let 
f{x,y) = z{x)~^z{y) — k{x,y). Let M. := p{X'^). Then 


Pr(|||/||2 _E||/|| 


> e) < 2 exp 
< 2 exp 


/ -D3£2 

\8{4D + 1YM^ 
( -dY- 

V200Af2 


^kf is also a PSD kernel, by the Schur product theorem. 


We first consider kernel ridge regression (KRR; Saun¬ 
ders et al. 1998). Suppose we are given n training pairs 
{xi,yi) S X K as well as a regularization parameter 
A = nXo > 0. We construct the training Gram matrix K 
by Kij = k{xi,Xj). KRR gives predictions h{x) = a^kx, 
where a = {K-\-XI)~^y and kx is the vector with Ah com¬ 
ponent k{xi, x)d When using Fourier features, one would 
not use a, but instead a primal weight vector w, still, it will 
be useful for us to analyze the situation in the dual. 

Proposition 1 of Cortes et al. (2010) bounds the change in 
KRR predictions from approximating the kernel matrix K 
by K, in terms of || AT — Ar|| 2 . They assume, however, that 
the kernel evaluations at test time kx are unapproximated, 
which is certainly not the case when using Fourier features. 
We therefore extend their result to Proposition 9 before us¬ 
ing it to analyze the performance of Fourier features. 

"'ll a bias term is desired, we can use k'{x, x') = k{x, x') 1 

by appending a constant feature 1 to the embedding 2 . Because 
this change is accounted for exactly, it affects the error analysis 
here only in that we must use sup|fc(x, y)\ <2, in which case the 
first factor of (11) becomes (Ao -f 2)/Ao. 
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Proposition 9. Given a training set {{xi,yi)}'^^i, with 
Xi G and yi G K, let h{x) denote the result of ker¬ 
nel ridge regression using the PSD training kernel matrix 
K and test kernel values k^. Let h{x) be the same using a 
PSD approximation to the training kernel matrix K and test 
kernel values kx- Further assume that the training labels 
are centered, Vt ~ n Vi - 

suppose ||fcx||oo < K. Then: 

\hfx) - h{x)\ < -p^Wkx - kxW Kh. 

s/nXo nAp 

Proof. Let a = {K XI)~^y, a = {K XI)~^y. Thus, 
using M~^ — M~^ = we have 

a-a = -{k + XI)-'^{k - K){K + XI)-^y 

||a-a|| < Wik + XI)-%\\k - KhWiK + XI)-%\\y\\ 

<^\\k-Kh\\y\\ 

since the smallest eigenvalues of K G- XI and iT + A/ are 
at least A. Since \\kx II < y/uK and ||d|| < ||j/||/A: 

\h{x) — h{x)\ = \cJkx — a^kx\ 

= \ck{kx - kx) + (a - a)'^kx\ 

< ||d||||fc^-fc^|| + ||d-a||||A::r|| 

/ lly|l 111 ; II , VnuWyW I, 

< ^llfco; - fc^ll + ——\\K - K\\2. 

The claim follows from A = nXo, || 2 /|| = ^ 


Suppose that, per the uniform error bounds of Section 2.2, 

sup|fc(x,j/) — s(x,y)\ < e. Then ||fca; — fc^ll < -^/n^ 

||iT — itr||2 < ||iT — K\\f < ne, and Proposition 9 gives 


h{x) — h(x) 


< -^^e+kvne 
y'nXo nAp 

^ Ao + 1 
— ^2 ^y^- 


( 11 ) 


Thus 


Pr {\h'(x) — h{x)\ > e) < Pr ^|| 


Xle 


(Ao + l)crj, 


which we can bound with Proposition 1 or 2. We can there¬ 
fore guarantee \h{x) — h'{x)\ < e with probability at least 


i5if 


D = n 


V Age ) 


log 5 log 


Age 

(Aq + f)o'y 



Note that this rate does not depend on n. If we want 
h'{x) -G h{x) at least as fast as h(x)’s convergence rate of 
0(l/v^) (Bousquet and Elisseeff 2001), ignoring the log¬ 
arithmic terms, we thus need D to be linear in n, matching 
the conclusion of Rahimi and Recht (2008a). 


3.2 SUPPORT VECTOR MACHINES 

Consider a Support Vector Machine (SVM) classiher with 
no offset, such that h{x) = i(;^<i>(a;) for a kernel embed¬ 
ding ^{x) : X —i' H and w is found by 

1 C ■" ' 

argmin -llwf + — V max ( 0,1 - y,{w, $(a;*))) 

w^H 2 


where {{xi, is our training set with yi G {—1,1}, 

and the decision function is h(x) = (tu, <i>(a:)).^ For a 
given X, Cortes et al. (2010) consider an embedding in 
= IR"+^ which is equivalent on the given set of points. 


They bound 


h(x) — h{x) 


in terms of ||7T — iT ||2 in their 


Proposition 2, but again assume that the test-time kernel 
values kx are exact. We will again extend their result in 
Proposition 10; 


Proposition 10. Given a training set with 

Xi G and yi G {—1,1}, let h(x) denote the decision 
function of an SVM classifier using the PSD training matrix 
K and test kernel values kx- Let h{x) be the same using 
a PSD approximation to the training kernel matrix K and 
test kernel values kx- Suppose snp k{x, x) < k. Then: 


\h{x) — h{x)\ 

< V2^iCo {\\k - Kh + \\kx - kx\\ + |/.|)'^" 

+ yAtCo (||iT - Kh + \\kx - kxW + l/.l)'^" , 
where fx = k(x, x) — k{x, x). 


Proof Use the setup of Section 2.2 of Cortes et al. (2010). 
In particular, we will use ||rt;|| < and their (16-17): 


where Kx 


K 

kl 


kx 

k{x, x) 


and Ci the ith standard basis. 


^ 1 /2 

Further, Lemma 1 of Cortes et al. (2010) says that \\Kx — 

Kl^'^h < \\Kx - KxWy^. Let fx ■■= k{x,x) - k{x,x)-. 

Then, by Weyl’s inequality for singular values. 


< ||iT-iT ||2 + ||fc,-fc,|| + |/,|. 

2 


^We again assume there is no bias term for simplicity; adding 
a constant feature again changes the analysis only in that it makes 
the K of Proposition 10 2 instead of 1. 


k — K kx — kx 

k~^ — k~^ f 

.^X "'X jx 
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Thus 

\h{x) — h{x)\ 

= {w — w)~^^{x)+w~^{^{x) — ^{x)) 

< llti - u;|||jl>(a;)|| + ||i(;||||<l(a;) - $(a;)|| 

< V2KiCo\\Kl^^ - 

+ y^CoWikl^^ - K/^)en+l\\ 

< V2KiCo\\k, - 

< V2JC0 {\\k - Kh + \\k - k^W + l/xl)'^^ 

+ ^/kCq ^||iT — K\\2 + \\kx — k^W + \fx\j 
as claimed. □ 

Suppose that sup|fc(a;,y) — s{x,y)\ < e. Then, as in the 
last section, ||fca; —A:a;|| < y^ne and ||iT —iT ||2 < ne. Then, 
letting 7 be 0 for z and 1 for z. Proposition 10 gives 

\h{x) — h{x)\ < v^Cq (n + Vn + 

+ Co (n + yjn + 7 )^^^ . 

Then \h{x) — h{x)\ > u only if 

^ 2Cg + 4Cou -\- — 2(Co + u)^JC q{Cq + 2u) 

C^(n + Vn + 7) 

This bound has the unfortunate property of requiring the 
approximation to be more accurate as the training set size 
increases, and thus can prove only a very loose upper bound 
on the number of features needed to achieve a given ap¬ 
proximation accuracy, due to the looseness of Proposi¬ 
tion 10. Analyses of generalization error in the induced 
RKHS, such as Rahimi and Recht (2008a); Yang et al. 
( 2012 ), are more useful in this case. 

3.3 MAXIMUM MEAN DISCREPANCY 

Another area of application for random Fourier embed¬ 
dings is to the mean embedding of distributions, which uses 
some kernel k to represent a probability distribution P in 
the RKHS induced by k as ip{P) = [k{x, •)]. For 

samples ~ P and {Yj}JLi ^ Q, we can estimate 

the inner product in the embedding space, the mean map 
kernel (MMK), by 

^ n m 

MMK(A, Y):= -V V k{X„ Yj) « {p{P), if{Q)). 


The distance \\ip{P) — y’(Q)|| is known as the maximum 
mean discrepancy (mmd), which can be estimated with; 

MP)-p{Q)r 

= {¥>{P),‘P{P)) + (kQ): ¥>{Q)) - 2 k{P), ‘piQ)} ■ 

MMK(X, A) is a biased estimator, because of the 
k{Xi, Xi) and k{Yi, Y^) terms; removing them gives an un¬ 
biased estimator (Gretton et al. 2012). The MMK can be 
used in standard kernel methods to perform learning on 
probability distributions, such as when images are treated 
as sets of local patch descriptors (Muandet et al. 2012) or 
documents as sets of word descriptors (Yoshikawa et al. 
2014). The MMD has strong applications to two-sample 
testing, where it serves as the statistic for testing the hy¬ 
pothesis that X and Y are sampled from the same distri¬ 
bution (Gretton et al. 2012); this has applications in, for 
example, comparing microarray data from different exper¬ 
imental situations or in matching attributes when merging 
databases. 

The MMK estimate can clearly be approximated with an 
explicit embedding; if k{x,y) « z{xkz{y), 

- n m 

mmk2 (A, Y) = — 

= zixkm- 

Thus the biased estimator of MMK(A, X) is just ||z(A)||^; 
the unbiased estimator is 

(\\z{X)f - 

— n \ J 

When z{xkz{x) = 1, as with z, this simplifies to 

||z(A)|| — When that is not necessarily true, 

as with z, that simplification holds only in expectation. 

This has been noticed a few times in the literature, e.g. 
by Li and Tsang (201 1). Gretton et al. (2012) gives dif¬ 
ferent linear-time test statistics based on subsampling the 
sum over pairs; this version avoids reducing the amount 
of data used in favor of approximating the kernel. Addi¬ 
tionally, when using the MMK in a kernel method this ap¬ 
proximation allows the use of linear solvers, whereas the 
other linear approximations must still perform some pair¬ 
wise computation. Zhao and Meng (2014) compare the 
empirical performance of an approximation equivalent to 
z against other linear-time approximations for two-sample 
testing. They find it is slower than the MMD-linear approx¬ 
imation but far more accurate, while being more accurate 
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and comparable in speed to a block-based _B-test (Zaremba 
etal. 2013). 

Zhao and Meng (2014) also state a simple uniform er¬ 
ror bound on the quality of this approximation. Specif¬ 
ically, since we can write |MMK^(2f,F) — MMK(2f,y)| 
as the mean of \f{Xi,Yj)\, uniform error bounds on / 
apply directly to MMK^, including to the unbiased ver¬ 
sion of MMK 2 (X, 2f). Moreover, since MMD^(2f, F) = 
MMK(X, X) + MMK(F, F) - 2MMK(2f, F), its error is at 
most 4 times ||/||oo- The advantage of this bound is that it 
applies uniformly to all sample sets on the input space X, 
which is useful when we use MMK for a kernel method. 


evenly spaced 1000 points on [—5,5] and approxi¬ 
mated the kernel matrix using both embeddings at D G 
{50,100,200,..., 900,1000,2 000,..., 9 000,10 000}, 
repeating each trial 1000 times, estimating ||/||oo 
WfWfi at those points. We do not consider d > 1 here, 
because obtaining a reliable estimate of sup|/| becomes 
very computationally expensive even for d = 2. 

Figure 3 shows the behavior of E||/||oo as b increases for 
various values of D. As expected, the z embeddings have 
almost no error near 0. The error increases out to one or two 
bandwidths, after which the curve appears approximately 
linear in l/a, as predicted by Proposition 3. 


For a single two-sample test, however, we can get a tighter 
bound. Consider X and F fixed for now. Note that 
EMMKz(Ar, F) = MMK(Ar, F), by linearity of expecta¬ 
tion. The variance of MMKz(Ar, F) is exactly 

^ E E Cov (a(X„ Y,),s{X,,Y,,)) , (12) 

id i'd' 

which can be evaluated using the formulas of Section 2.1 
and so, viewed only as a function of D, is 0(1/D). Alter¬ 
natively, we can use a bounded difference approach; view¬ 
ing MMK 2 (Ar, F) as a function of the WiS, changing Ui to 
uji changes the MMK estimate by 

1 n m ^ 

^ E E (cos(w7 {X, - Yj)) - cos{u}J (X, - Yj))) 

i=l j=l ^ 

which is at most 4/11. The bound for z is in fact the same 
here. Thus McDiarmid’s inequality tells us that for fixed 
sets X and F and either z. 



Figure 3; The maximum error within a given radius in K, 
averaged over 1000 evaluations. Solid lines represent z 
and dashed lines z\ black is D = 50, blue is H = 100, red 
D = 500, and cyan D = 1 000. 


Pr(|MMK^(X,F) - MMK(X,F)|) < 2exp . 

Thus E |mmk^(X, F) - mmk(X, F)| < 2^2^. Simi¬ 
larly, MMD^ can be changed by at most 16/Zl, giving 

Pr (|mmd^(X, F) - MMD(X, F)|) < 2 exp 

and expected absolute error of at most 

Now, if we consider the distributions P and Q to be fixed 
but the sample sets random. Theorems 7 and 10 of Gretton 
et al. (2012) give exponential convergence bounds for the 
biased and unbiased population estimators of MMD, which 
can easily be combined with the above bounds. Note that 
this approach allows the domain X to be unbounded, un¬ 
like the other bound. One could extend this to a bound 
uniform over some smoothness class of distributions using 
the techniques of Section 2.2, though we do not do so here. 

4 NUMERICAL EVALUATION 

4.1 APPROXIMATION ON AN INTERVAL 


Figure 4 fixes 6 = 3 and shows the expected maximal error 
as a function of D. It also plots the expected error ob¬ 
tained by numerically integrating the bounds of Proposi¬ 
tions 1 and 2 (using the minimum of 1 and the bound). We 
can see that all of the bounds are fairly loose, but that the 
first version of the bound in the propositions (with /3d, the 
exponent depending on d, and a^) is substantially tighter 
than the second version when d = 1. 

The bounds on E||/||oo of Propositions 3 and 4 are un¬ 
fortunately too loose to show on the same plot. However, 
one important property does hold. For a fixed X, (8) pre¬ 
dicts that E||/||oo = 0{l/\/l)). This holds empirically; 
performing linear regression of logE||/||oo against logH 
yields a model of E||/||oo = e'^D^, with a 95% con¬ 
fidence interval for m of [-0.502,-0.496]; Jj/Uoo gives 
[—0.503, —0.497]. The integrated bounds of Propositions 1 
and 2 do not fit the scaling as a function of D nearly as well. 

Figure 5 shows the empirical survival function of the max 
error for D = 500, along with the bounds of Propositions 1 
and 2 and those of Propositions 5 and 6 using the empirical 
mean. The latter bounds are tighter than the former for low 
e, especially for low D, but have a lower slope. 


We first conduct a detailed study of the approxima¬ 
tions on the interval X = [—6,6]. Specifically, we 
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Figure 4: E||/||oo for the Gaussian kernel on [—3, 3] with 
cr = 1, based on the mean of 1 000 evaluations and on 
numerical integration of the bounds from Propositions 1 
and 2. (“Tight” refers to the bound with constants depend¬ 
ing on d, and “loose” the second version; “old” is the ver¬ 
sion from Rahimi and Recht (2007).) 


The mean of the mean squared error, on the other hand, 
exactly follows the expectation of Section 2.3 using p as 
the uniform distribution on in this case, E||/||^ « 
0.66/77, E||/||^ « 0.83/77. (This is natural, as the ex¬ 
pectation is exact.) Convergence to that mean, however, 
is substantially faster than guaranteed by the McDiarmid 
bound of Propositions 7 and 8. We omit the plot due to 
space constraints. 

4.2 MAXIMUM MEAN DISCREPANCY 

We now turn to the problem of computing the MMD 
with a Fourier embedding. Specifically, we consider 
the problem of distinguishing the standard normal dis¬ 
tribution Af{0,lp) from the two-dimensional mixture 
0.95A/'(0, 12 ) + 0.05A/'(0, |72)- We take fixed sample sets 
X and Y each of size 1000 and compute the biased MMD 
estimate with varying 77 for both z and z, we used a Gaus¬ 
sian kernel of bandwidth 1. The mean absolute errors of 
the resulting estimates are shown in Figure 6. z performs 
mildly better than z. 

Again, the McDiarmid bound of Section 3.3 predicts that 
the mean absolute error decays as 0(1/V^), but with too 
high a multiplicative constant; the 95% confidence inter¬ 
val for the exponent of 77 is [—0.515, —0.468] for 5 and 
[—0.520, —0.486] for z. We also know that the expected 
root mean squared error decays like 0(1/ VD) via (12). 

5 DISCUSSION 

We provide a novel investigation of the approximation er¬ 
ror of the popular random Fourier features, tightening ex- 



Figure 5: Pr(Ej|/j|oo > e) for the Gaussian kernel on 
[—3,3] with a = 1 and 77 = 500, based on 1000 eval¬ 
uations (black), numerical integration of the bounds from 
Propositions 1 and 2 (same colors as Figure 4), and the 
bounds of Propositions 5 and 6 using the empirical mean 
(yellow). 



Figure 6: Mean absolute error of the biased estimator for 
MMD(X, Y), based on 100 evaluations. 

isting bounds and showing new ones, including an analytic 
bound on Ej|/j|oo and exponential concentration about its 
mean, as well as an exact form for Ejj/jj^ and exponential 
concentration in that case as well. We also extend previous 
results on the change in learned models due to kernel ap¬ 
proximation. We verify some aspects of these bounds em¬ 
pirically for the Gaussian kernel. We also point out that, of 
the two embeddings provided by Rahimi and Recht (2007), 
the 5 embedding (with half as many sampled frequencies, 
but no additional noise due to phase shifts) is superior in 
the most common case of the Gaussian kernel. 
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A PROOFS FOR UNIFORM ERROR BOUND (SECTION 2.2) 


A.1 PROOF OF PROPOSITION I 

The proof strategy closely follows that of Rahimi and Recht (2007); we fill in some (important) details, tightening some 
parts of the proof as we go. 

Let Xa = {x — y \ x,y G X}. It’s compact, with diameter at most 2£, so we can find an e-net covering Xa with at most 
T = {MjrY balls of radius r (Cucker and Smale 2001, Proposition 5). Let denote their centers, and be the 

Lipschitz constant of /. If |/(Ai)| < e/2 for all i and Lj < e/(2r), then |/(A)| < e for all A G M.a- 

^ D/2 

LetZi(a;) := [sin(a;7a;) cos(wJa;)] , so that z(a:)^z(7/) = —— Y^UxYUy)- 

' i=l 


A.I.I Regularity Condition 

We will first need to establish that EVs(A) = VEs(A) = Vfc(A). This can be proved via the following form of the 
Leibniz rule, quoted verbatim from Cheng (2013): 

Theorem (Cheng 2013, Theorem 2). Let X be an open subset o/M, and LI be a measure space. Suppose / : A x 17 —)■ K 
satisfies the following conditions: 


1. f{x, to) is a Lebesgue-integrable function of cofor each x G X. 

2. For almost all uj G Ll, the derivative exists for all x G X. 


3. There is an integrable function 0 : 17 —> K such that 


df(x,ui) 

dx 


< 0(w) for all X G X. 


Then for all x G X, 


d 

da; 


/ fix,uj)duj= [ ^fix,uj)duj. 

Jn Jn dx 


Define the function g], y{t,uj) : K x 17 —K by p® y{t, to) = -f tci, y), where Ci is the ith standard basis vector, and 
uj is the tuple of all the used in 5. (f. y{t,-) is Lebesgue integrable in w, since 


gl^y(t, uj) dw = Es(a; -I- tCi, y) = k{x + ta, y) < oo. 


For any uj G LI, ^ 5 ® exists, and satisfies: 


E, 


d 




= E, 


= E, 


<E, 


<E, 


2 d d 

— ^ sin(wj?/) — sin(a;Ja; + tujji) + cos(ujjy) — cos(ujJx -f tujji) 


/=i 

D/2 


— ujji sin(ujjy) cos(a;Jx + tcu^i) — ujji cos(ujjy) sm(ujjx + tuoji) 


/=i 

D/2 


— \ujji sin(a;j2/) cos(ujJx -f tujji)\ -\- \ujji cos{ujJy) sin(wja; + tujji)\ 


i=i 

D/2 


i=i 




< 2E<, |w|, 


which is finite since the first moment of uj is assumed to exist. 

Thus we have ^Es(x, y) = E-^s{x, y). The same holds for y by symmetry. Combining the results for each component, 
we get as desired that EVas(x, y) = VAEs(a;, y). 
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A.1.2 Lipschitz Constant 

Since / is differentiable, = V/(A*) , where A* = argmax^g_;y^^ V/(A) 
Via Jensen’s inequality, E ||Vs(A)|| > |lEVs(A)||. Now, letting A* = x* — y*: 


E[Lj] = E |!Vs(A*) - Vk{A* 


= Ea* 

< Ea* 
= E 


E 


E 


||VS(A*)f - 2 ||Vfc(A*)|| E[ ||VS(A*)|| ] + ||Vfc(A*)|f 


||Vs(A*)f -2||Vfc(A*)f 


|Vfc(A*)||' 


Vs(A*)fJ - Ea* 
<E|lVs(A*)f 

= E\\\'z{x*)'^z{y*)f 

D/2 


iivfc(A*)ir 


= E 


-l^j2Hx*Vzi{y*) 


= E \\Wzi{x*)'^Zi{y*)\ 


= E IIV cos(a;'''A*)| 


= E 11 — sin(a;'''A*) uj \ 


= E 
< E 


"*)Ti 

,2/ ,T A*^ 


sin"(w'A*)||w|| 






We can thus use Markov’s inequality: 


Pr 






2 r 


A.1.3 Anchor Points 


(13) 


For any fixed A = x — y, /(A) is a mean of D/2 terms with expectation k{x, y) bounded by ±1. Applying Hoeffding’s 
inequality and a union bound: 


/ 


Pr |J|/(A,)I> 5 £ <TFr 


\i^l 




Since we know the variance of each term from (5), we could alternatively use Bernstein’s inequality: 

De^ 


TPr ( /(A) > ie) < 2Texp 


D§1 

2 4 


2 Var[cos(w'''A)] + |e 


= 2T exp — 


16 (Var[cos(w^A)] + ^e) 


This is a better bound when Var[cos(a;^ A)]+ |e < 1; for the RBF kernel, this is true whenever e< I , and the improvement 
is bigger when k{A) is large or e is small. 

To unify the two, let := min (l, maxAeAiA | + ^k{2A) — A:(A^) + ^e). Then 


Pr |^1J|/(A,)| > iej < 2T exp • 
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A. 1.4 Optimizing Over r 

Combining these two bounds, we have a bound in terms of r: 


Pr 


sup 

.AGMa 




letting Ki = 2(4f)‘^exp 


, K2 = dcTpE 


If we choose r = as did Rahimi 

could instead maximize the bound by choosing r such that dnir 

becomes 1- : 


2 d 

limi and Recht (2007), the bound again becomes 1 — . But we 

1 

-d-i _ 2K2r = 0, i.e. r = Then the bound 


Pr f sup 
Vaga^a 


2 

ti+2 


(dcTpE 

De^ 


/(A)| > (^2{4£fexp 

2d 


(14) 

(15) 


For e < cTpf, we can loosen the exponent on the middle term to 2, though in low dimensions we have a somewhat sharper 
bound. We no longer need the f > 1 assumption of the original proof. 

To prove the final statement of Proposition 1, simply set (14) to be at most <5 and solve for D. 


A.2 PROOF OF PROPOSITION 2 


We will follow the proof strategy of Proposition 1 as closely as possible. 

Our approximation is now s(x, y) = . 2 (x)^.z(y), and the error is/(x, y) = s{x,y) — k{y,x). Note that s and / are not shift- 
invariant; for example, with D = 1, s(x, y) = cos(a;^A)-|-cos(a;^(x-|-y)-|-26) but s(A, 0) = cos(a;^A)-|-cos(a;^A-|-26). 


Let q 


X 

y. 


G denote the argument to these functions. X^ is a compact set in with diameter so we can cover 


it with an e-net using at most T = {2\/2llr^^ balls of radius r. Let {qi}f^i denote their centers, and L/ be the Lipschitz 
constant of / : —>■ K. 


A.2.1 Regularity Condition 

To show EVs(y) = VEs(y), we can define yj, y{t,uj) analogously to in Appendix A.1.1, where here w contains all the uii 
and bi variables used in z. We then have; 




= E,^ 

1 ° 

— ^ —Wji cos(ujJy + bj)sm(ujJx + tujji -f bj) 

< E„ 


dt 



i=i 


3 = 1 


<E, 


which we have assumed to be finite. 

A.2.2 Lipschitz Constant 

The argument follows that of Appendix A. 1.2 up to (13), using q* in place of A*. Then; 

E[L2] <E||Vs(y*)f 

= E II Vq (2 cos(w'''x -f b) cos{u}^y + b)) ||^ 
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= E 
= E 
= E 
= E, 

= E, 


II Va; (2cos(w'''a; + b) cos{uj^y + b)) ||^ + ||Vy (2cos(a;'''x + b) cos{u}^y + 6)) 
||—2 sin(w'''x* + b) cos(uj^y* + 6) w||^ + ||—2cos(w'''a;* + b) sin(u;^y* + 6) w 
4 (sin^+ b) cos^(uj^y* + b) + cos^(uj^x* + b) sm'^{uj^y* + b)) ||a;||^ 

Eb [2 — cos(2a;'''{a;* — y*)) — cos{2u}~^{x* + y*) + 46)] ||a; 

(2 — cos(2w'''(a;* — y*))) ||w| 


< 3E||w|r = 3a^. 

Following through with Markov’s inequality: 

Pr (^Lj! > e/{2r)j < 3ap{2rjef = 12{aprle)‘^. 

A.2.3 Anchor Points 

For any fixed cc, y, s takes a mean of D terms with expectation k{x, y) bounded by ±2. Using Hoeffding’s inequality: 
Pr (^Ul/(g.)l > < TPv (|/(g)| > ig) < 2rexp = 2rexp . 


Since the variance of each term is given by ( 6 ), we can instead use Bernstein’s inequality: 


TPr I 


( /(^) > h) ^ 2 rexp I 




2 (Var[cos(a;TA)] + |) + 
Thus Bernstein’s gives us a tighter bound if 


= 2T exp — 


De^ 


4 + 8 Var[cos(a;'''A)] + J 


4 + 8 Var[cos(w^A)] + < 32 i.e. 2 Var[cos(a;^A)] + < 7. 

o o 


To unify the bounds, define a]. = min (l, maxA | + j Var[cos(a;^A)] + then 

Pr ^ ^ 2rexp ■ 


K.IA Optimizing Over r 

Our bound is now of the form 


Pr sup 
\qeM^ 


m 


< e I > 1 — KiV — K2 t'^ , 


with m =2 (2-\/2i!)^'^ exp and K 2 = 120^6 

1 

This is maximized by r when 2dKir~^‘^~^ — 2k2T = 0, i.e. r = ^ • Substituting that 

( —d 1 \ 1 

d^+T + da+T j , and thus: 


value of r into the bound 


Pr sup 


/(d) 


> 5 ) <( 


d<j+i dd+i] 2 


(2V2eJ 


32a' 


2d 

( i+2d+d+2d d (aj,i\ a+1 

= (^da+1 + da+1 j 2 a+i 3d+i(_l^ 


2ci / 0^2 \ \ a+i d 

exp -^n (I2a2e-Aa+T 


exp - 


De^ 


32(d + l)a; 


(16) 
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5d+l d 
2 d+i 3d+i 



2 

l + l/d 

exp 


De^ \ 
~i2{d+l)a'J- 


As before, when e < tjpf we can loosen the exponent on the middle term to 2; it is slightly worse than the corresponding 
exponent of (15) for small d. 

To prove the final statement of Proposition 2, set (16) to be at most 6 and solve for D. 


A.3 PROOF OF PROPOSITION 3 


Consider the z features, and recall that we supposed k is L-Lipschitz over Aa := {x — y \ x,y G X}. 

Our primary tool will be the following slight generalization of Dudley’s entropy integral, which is a special case of 
Lemma 13.1 of Boucheron et al. (2013). (The only difference from their Corollary 13.2 is that we maintain the variance 
factor V.) 

Theorem (Boucheron et al. 2013). Let T be a finite pseudometric space and let be a collection of random 

variables such that for some constant v > 0, 


for all t,f G T and all A > 0. Let 6 = sup^g-y- d{t, Iq). Then, for any Iq G T, 


E 


sup Xt - Xta 
ter 


< 12 i/u / sjH{u,T) du. 
Jo 


Note that, although stated for finite pseudometric spaces, the result is extensible to seperable pseudometric spaces (such as 
Xa) by standard arguments. Here H{6,T) = log N{S,T), where N is the ^-packing number, is known as the J-entropy 
number. It is the case that the (5-packing number is at most the |-covering number, which Proposition 5 of Cucker and 
Smale (2001) bounds. Thus, picking Aq = 0 gives S = £, H{S, Xa) < dlog (8f/(5), and 

/•C2 _ r^n 

/ \/H{u, Xa) du< \/d\og{8£/u) du = y£s/~d, 

Jo Jo 

where 7 := 4 i/ 7 rerfc(2-\/log2) + s/\og2 « 0.964. 

Now, ^ (cos(w 7A) — fc(A) — cos{(jjJA') -f k{A')) has mean zero, and absolute value bounded by 


< ^ (|cos(wjA) - cos(w 7A')| -b |fc(A) - fc(A')l) 

<^(|a;7A-a;7A'|-fL||A-A'||) 

<^(||a;.|l+L)||A-A'||. (17) 

Thus, via Hoeffding’s lemma (Boucheron et al. 2013, Lemma 2.2), each such term has log moment generating function at 
most J 2 (||wj + T)^A^||A - A'lp. 

This is almost in the form required by Dudley’s entropy integral, except that uJi is a random variable. Thus, for any r > 0, 
define the random process pr which is distributed as / except we require that ||cui || = r and ||a;i|| < r for alH > 1. Since 
log mgfs of independent variables are additive, we thus have 


— (cos(a;7A) — k{A) — cos(a;7A') -b fc(A')) 


D/2 

logEe^^Sr(A)-SAA')) < A I ^ I ^ 2 |,^ _ ^,||2 <A{r + LfX^' 


A-A'll 


2=1 


Pr satisfies the conditions of the theorem with v = jy{r + Lf^. Now, ^^(0) = 0, so we have 


E 


sup Pr(A) 

.AeA'A 


< 


l2-fs/d£ 


{r + L). 
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But the distribution of / conditioned on the event maxi||a;i|| = r is the same as the distribution of gr- Thus 


Esup/ = Ef [E[supgr]] < E^. 


127 V^f 


(r+ L) 


12 -/^/di 


(i? + L) 


where R :=E max^^ 11 11 • 


The same holds for Esup(—/). Since we have sup/ > 0, sup(—/) > 0, the claim follows from 


E 


max(sup/,sup(-/)) < E sup / + sup(-/) 


A.4 PROOF OF PROPOSITION 4 


For the z features, the error process again must be defined over due to the non-shift invariant noise. We still assume 
that k is L-Lipschitz over Xa, however. 

Compared to the argument of Appendix A.3, we have H{u, X^) < 2d\og {i'/2£ju). Unlike Xa, however, X^ does not 
necessarily contain an obvious point go to minimize sup^g _^.2 d(g, qo), nor an obvious minimal value. We rather consider 
the “radius” p := supj.g;^’ d{x, ccq), achieved by any convenient point xq; then d (g, (xq, xq)) = V^p- Note that 

\£ < p < £, where the lower bound is achieved by A a ball, and the upper bound by A a sphere. The integral in the bound 
is then 


/■p/%/2 rP/V^ j - 

J y^H{u,X^) < J ^2dlog(4v^f/w) 


p/%/2 


= erfc ^ log 2 -I- log4v^^ j -I- p'/d^J | log 2 -f log | 

= (^4V^erfc (^^ilog2 + log4y2^^ + f log 2 + log fv^. (18) 

Calling the term in parentheses 7 ^/p, we have that -/[ « 1.541, 72 « 0.803, and it decreases monotonically in between, as 
shown in Figure 7. 

y' 



Figure 7: The coefficient of (18) as a function of f/p. 


We will again use the notation of g = (x, y) € X^, A = x — y,t = x + y. Each term in the sum of /(g) — f{q') has mean 
zero and absolute value at most 

^|cos(a;7A) -f cos{ujJt + 2bi) — fc(A) — cos(a;7A') + cos{u}Jt' + 2hi) + fc(A')| 

< ^ (|cos(a;7A) — cos(a;7A')| -f |cos(a;7f + 2hi) — cos(a;7f' + 26i)| + |fc(A) — /c(A')|) 
<i(||a;.||||A-A'|| + ||..,||||f-f'||+L||A-A'||). 
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Now, in order to cast this in terms of distance on let Sx = x — x', 5y = y — y'. Then 

lk-9'f = l|4f+ ||<5,f 

(IIA - A'll + ||f - t'Wf = i^^\\5xr + \\5y\\^-26l5y + y^||4|P + ||<5,|P + 25J5,) 

= 2||4f + 2||5,f + 2^{\\6xV + \\5yrf-A{5l5yY 

<4(||4f+ ||^,f) 

|lA-A'|| + ||f-f'||<2||g-g'|| 

||A-A'||<2||g-g'|| 


and so each term in the sum of f[q) — f{q') has absolute value at most ^ (||a;i|| + L)\\q — q'\\. Note that this agrees exactly 
with (17), but the sum in f{q) — f{q') has D terms rather than D/2. Dehning cjr analogously to gr, we thus get that 


D 


logEe^^SA<i)-S.W)) < ^ ^ (||u;,|| + if X^q - g'f < |^(r + LfX^q - g'f, 


i=l 


D 


and the conditions of the theorem hold with u = ^(r + L)^. Note that Egr(<?o) = 0. Carrying out the rest of the argument, 
we get that 


Esup/ = Ej.[E[supgr]] < Er 


24/3,//Td 

Vd 


(r + L) 


24l3i/p£Vd 

Vd 


(R + L), 


and similarly for E sup /. We do not have a guarantee that /(g) does not have a consistent sign, and so our bound becomes 


E||/||oo <E 




< 


I oo I / crosses 0 
48l3e//Vd o) + 3 Pr (/ 


Pr ( / crosses 0 ) + 3 Pr does not cross 0 
does not cross 0 ) . 


Pr crosses 0) is extremely close to 1 in “usual” situations. 


B PROOFS FOR La ERROR BOUND (SECTION 2.3) 


B.l BOUNDED CHANGE IN 


Viewing ||/||^ as a function of uii,... we can bound the change due to replacing uji by wi. (By symmetry, the 

change is the same for any uii.) 

l(c0i,UJ2, ■ ■ ■ ,^ 0 / 2 ) — ll/ll^(wi,u;2, ■ ■ ■ , W£)/2) 

f f 2 2 V 

Jx^ I /) cos(w[ {x - y)) + cos(a;7 {x - y)) - k{x, y) 1 dy{x, y) 

f ( 2 2^^'^ V 

~ Ix^ \ ^ cos(w7 {x-y))X — cos(w7 {x - y)) - k{x, y) 1 <ly{x, y) 

f f ( V 

- y)Wix, y) + cos(w7 - y)) - y) 1 d^l{x, y) 


4 

Z)2 


f 2 / 2 \ 

+2 ^ cos(w7 {x-y)) I ^ cos(a;7 {x - y)) - k{x, y) 1 d^i{x, y) 
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4 f f ( V 

~D^ I ^^cos(w7(a;-y))-fc(x,2/) 1 d^l{x,y) 

f 2 (2 \ 

-2 ^ cos(w7 {x-y)) I 51 cos(a;7 {x - y)) - k{x, y) 1 dy{x, y) 

^ ^ Ix^ (cos^(‘^i^(a;-y))-cos2(w[(a;-2/)))d/x(x,y) 

4 /• / 2 \ 

+ ^ {cos(ujJ{x-y))-cos{uJ{x-y))) \—Y^cos{ujJ{x-y))-k{x,y)\ dy{x,y) 

~ ^ Ix^ d^(a;, ^ 4dy{x, y) 


4 16\ , o^ 16£> + 4 , .X 


B.2 BOUNDED CHANGE IN | j /11 ^ 

We can do essentially the same thing for /: 

\{uJi,UJ2, ■ ■ ■ , WD/ 2 ) — ll/ll^(‘^l)‘^2) • ■ • ,^ 0 / 2 ) 

j 2 (cos(w7 {x - y)) + cos{ujJ{x + y) + 2bi)) 

2 

+ —^ [cos(w7(a;-y)) + cos(a;7(x + j/) + 2&i)] -k{x,y) j dy.{x,y) 

” / 2 (cos(w[(a; - y)) + cos(w7 {x + y) + 26,)) 

2 V 

+ —^ [cos(w7(a;-y)) + cos(a;7(x-y) + 26i)] -k{x,y) d^i(x,y) 
i=2 j 

J (^{cos{ojJ{x - y)) + cos(ujJ(x + y) + 26^))^ - (cos(w7(a; - y)) + cos(ujJ(x + y) + 2bi))^J dy(a 

+ ^ / (cos(w)'^(a; - y)) + cos(ujJ(x + y) + 2bi) - cos(wJ^(a; - y)) - cos(a))'^(x + y) + 26,)) 

^ Jx^ 

^ 51 “ y)) + cos{u}J{x + y) + 2bi)] - k{x, y) j dy{x, y) 


<y) 


< 


D2 


1X2 


8dy{x,y) + ^[ 8dy{x,y) 
^ JX2 


32 32 

“ ) 
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